Cuda/NVML Components: Dynamically Search for the Shared Objects#347
Merged
Treece-Burgess merged 1 commit intoicl-utk-edu:masterfrom Apr 30, 2025
Merged
Conversation
e5497e9 to
d0d1761
Compare
Contributor
Author
|
Hello @scaronni, I have updated the Cuda and NVML component code to now search for variations of the different shared objects. If you have time to review/test the code and notice any issues please let me know! |
|
I am reviewing this PR. |
tokey-tahmid
approved these changes
Apr 29, 2025
tokey-tahmid
left a comment
There was a problem hiding this comment.
Approving the PR after performing the following tests:
- Methane (1 * A100) + cuda/12.5.1
- PAPI Utilities: ✅
- Cuda tests: ✅
- Hexane (1 * H100 && 1 * V100) + cuda/12.5.1
- PAPI Utilities: ✅
- Cuda tests: ✅
- Guyot (8 * A100) + cuda/12.5.1
- PAPI Utilities: ✅
- Cuda tests: ✅
57b56b5 to
2839665
Compare
…udart.so, libcudart.so.1 or libcudart (catch all).
2839665 to
a959c67
Compare
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Description
This PR updates the closed PR #328 which requested to use numbered versions for the shared objects instead of unnumbered for runtime. Instead of hard coding the numbered versions, we now will dynamically search for the shared objects.
For
libcuda,libcudart,libnvperf_host,libcupti, andlibnvidia-ml, there will be three naming schemes searched for:libcudart.solibcudart.so.1libcudart(forlibcudartthis would catch eitherlibcudart.so.12orlibcudart.so.12.5.82)Testing was done on Methane at ICL (1 * A100) and Athena at Oregon (4 * A100s) using the PAPI utilities to verify:
PAPI_CUDA_ROOTto Cuda Toolkit install directory: ✅PAPI_CUDA_RUNTIME,PAPI_CUDA_CUPTI,PAPI_CUDA_PERFWORKS, andPAPI_NVML_MAIN): ✅sovariations:libcudart.soandlibcupti.so: Found numbered versionlibnvidia-ml.so: Found numbered versionNote: Removed the function
linked_cuda_rtas this did not function properly and would returnPAPI_EMISC. Removing the function did not seem to alter functionality from testing.Author Checklist
Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
Commits are self contained and only do one thing
Commits have a header of the form:
module: short descriptionCommits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
The PR needs to pass all the tests